Skip to content

[4/7] Telemetry Event Emission and Aggregation#327

Open
samikshya-db wants to merge 138 commits intomainfrom
telemetry-4-event-aggregation
Open

[4/7] Telemetry Event Emission and Aggregation#327
samikshya-db wants to merge 138 commits intomainfrom
telemetry-4-event-aggregation

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

@samikshya-db samikshya-db commented Jan 28, 2026

Part 4 of 7-part Telemetry Implementation Stack

This layer adds event-driven telemetry emission, per-host shared
aggregation/export, and operation-level error telemetry on top of the
hardened infrastructure shipped in [1/7]–[3/7].

What's in this PR

TelemetryEventEmitter (lib/telemetry/TelemetryEventEmitter.ts)

Per-DBSQLClient event emitter — typed emission methods, respects the
client's telemetryEnabled flag, swallows all exceptions at debug level.
Each emitter bridges into the shared aggregator on the per-host
TelemetryClient.

Event types: CONNECTION_OPEN, CONNECTION_CLOSE (new), STATEMENT_START,
STATEMENT_COMPLETE, CLOUDFETCH_CHUNK, ERROR.

TelemetryClient owns the per-host triad (lib/telemetry/TelemetryClient.ts)

TelemetryClientProvider is a process-wide singleton. Each host gets one
TelemetryClient that owns:

  • DatabricksTelemetryExporter
  • MetricsAggregator
  • CircuitBreakerRegistry
  • FeatureFlagCache

Multiple DBSQLClient instances on the same host share these — breaker
counters and HTTP batches don't fragment per-instance. TelemetryClient
implements IClientContext so the owned components have a stable context
that survives any single DBSQLClient's close. Connection/auth providers
are tracked in a FIFO of registered contexts; the exporter falls through
to the next active one when the head closes.

MetricsAggregator (per-host, on TelemetryClient)

Restored from main's hardened version, with new functionality layered on:

  • CONNECTION_CLOSE handling — emits DELETE_SESSION connection metric.
  • Chunk-timing aggregationchunkInitialLatencyMs,
    chunkSlowestLatencyMs, chunkSumLatencyMs accumulated from
    CLOUDFETCH_CHUNK events with positive latency.
  • Memory bounds: maxPendingMetrics (drop preferring non-error to keep
    first-failure signal), maxErrorsPerStatement, statementTtlMs eviction.
  • Flush triggers: batch size, periodic timer (unref()'d), terminal-error
    immediate flush, manual flush().
  • close() is async and awaits the final HTTP POST so
    await client.close(); process.exit(0) doesn't truncate the last batch.

Error telemetry wired into operation entry points

DBSQLOperation now emits ERROR events (with ExceptionClassifier
terminal/retryable classification) from fetchChunk, cancel, close,
and getMetadata. Failed queries produce a STATEMENT_COMPLETE plus an
ERROR proto with error_info: { error_name, stack_trace } (stack run
through redactSensitive).

emitStatementComplete no longer issues a getMetadata Thrift RPC on
close (perf regression + spurious-error-telemetry trap).

Type-safe wiring (IClientContext)

Added optional getTelemetryEmitter() / getTelemetryAggregator() to
IClientContext. Removed all (this.context as any) casts at the seven
emit call sites (DBSQLOperation, DBSQLSession, RowSetProvider,
CloudFetchResultHandler).

The six copy-pasted listeners in DBSQLClient.initializeTelemetry are now
one bridge loop over Object.values(TelemetryEventType) — closes the
listener-name mismatch that originally caused error events to be silently
dropped.

mapAuthType covers all six authType values (access-token, databricks-oauth
{U2M/M2M}, custom, token-provider, external-token, static-token).

Verified end-to-end against an Azure Databricks workspace

Healthy SELECT 1 produces 3 wire metrics: CREATE_SESSION (with
system_configuration), STATEMENT_COMPLETE (with
sql_operation.execution_result), DELETE_SESSION.

Failed query produces 4: CREATE_SESSION, STATEMENT_COMPLETE (latency
only), ERROR (with redacted stack), DELETE_SESSION.

Server-side feature flag is still the kill switch — telemetryEnabled: false
on the client also skips the entire pipeline (no acquire/release noise).

Testing

484 unit tests pass (telemetry + DBSQLClient/Operation/Session/result).
Test files for the rebased modules are restored from main. Provider tests
updated for the singleton API.

Coverage gap acknowledged: no unit tests yet for the re-applied
chunk-timing aggregation or CONNECTION_CLOSE handling. Adding these as a
follow-up to the same epic.

Dependencies

Depends on:

Out of scope (open follow-ups)

  • Add operation_type proto field so CREATE_SESSION and DELETE_SESSION
    are explicitly distinguished on the wire (today they're distinguishable
    only by the incidental presence of system_configuration).
  • Add getStats() for telemetry-pipeline self-observability (drop counts,
    queue depth, last-success timestamp).
  • Expose remaining telemetry config knobs in ConnectionOptions (currently
    3 of 13).
  • Reunify connection.open/connection.close emission (open from DBSQLClient,
    close from DBSQLSession).
  • Unit tests for chunk-timing aggregation and CONNECTION_CLOSE.

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

The emission format confirms to the telemetry proto, marked this ready for review.

samikshya-db and others added 11 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
@samikshya-db samikshya-db force-pushed the telemetry-3-client-management branch from 87d1e85 to 32003e9 Compare January 29, 2026 20:21
samikshya-db and others added 11 commits January 29, 2026 20:21
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack.

Components:
- DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker
- TelemetryExporterStub: Test stub for integration tests

DatabricksTelemetryExporter:
- Exports telemetry metrics to Databricks via HTTP POST
- Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth)
- Integrates with CircuitBreaker for per-host endpoint protection
- Retry logic with exponential backoff and jitter
- Exception classification (terminal vs retryable)

Export Flow:
1. Check circuit breaker state (skip if OPEN)
2. Execute with circuit breaker protection
3. Retry on retryable errors with backoff
4. Circuit breaker tracks success/failure
5. All exceptions swallowed and logged at debug level

Retry Strategy:
- Max retries: 3 (default, configurable)
- Exponential backoff: 100ms * 2^attempt
- Jitter: Random 0-100ms to prevent thundering herd
- Terminal errors: No retry (401, 403, 404, 400)
- Retryable errors: Retry with backoff (429, 500, 502, 503, 504)

Circuit Breaker Integration:
- Success → Record success with circuit breaker
- Failure → Record failure with circuit breaker
- Circuit OPEN → Skip export, log at debug
- Automatic recovery via HALF_OPEN state

Critical Requirements:
- All exceptions swallowed (NEVER throws)
- All logging at LogLevel.debug ONLY
- No console logging
- Driver continues when telemetry fails

Testing:
- 24 comprehensive unit tests
- 96% statement coverage, 84% branch coverage
- Tests verify exception swallowing
- Tests verify retry logic
- Tests verify circuit breaker integration
- TelemetryExporterStub for integration tests

Dependencies:
- Builds on all previous layers [1/7] through [4/7]
Local-only e2e harness that duplicates what tests/e2e/telemetry/telemetry-integration.test.ts covers in CI.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Fixes the lint CI job which runs prettier --check before eslint.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
@databricks databricks deleted a comment from github-actions Bot Apr 28, 2026
Aggregate initial/slowest/sum chunk latencies in MetricsAggregator,
emit them in the proto chunk_details block, and time each FetchResults
RPC in RowSetProvider so the inline-Arrow path populates chunk
telemetry alongside CloudFetch (mirrors Go's chunkTimingAccumulator).

Also fix MetricsAggregator clobbering accumulated chunkCount and
bytesDownloaded to 0 on STATEMENT_COMPLETE when the event omitted
those fields — this hid chunk_details from every path.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

- Snapshot driverConfig on each statement at first event so a later
  CONNECTION_OPEN can't retroactively rewrite the config reported by
  in-flight statements (and their buffered errors).
- Attach a defensive .catch() to the fire-and-forget exporter.export()
  call so any future regression that leaks a rejection logs at debug
  rather than surfacing as an unhandled promise rejection.
- Document the unref()'d flush timer on DBSQLClient.close(): callers
  must await close() on shutdown to drain buffered telemetry; otherwise
  metrics between flush ticks are lost.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…iring, error telemetry

Rebase the regressed exporter/aggregator/feature-flag-cache on main's
hardened versions and re-apply only the genuinely new functionality
(CONNECTION_CLOSE event, chunk-timing aggregation) on top. Closes the
critical findings from the multi-reviewer audit:

  - SSRF guard, redactSensitive, sanitizeProcessName, hasAuthorization,
    auth-missing warn-once — all restored via main's telemetryUtils.
  - MetricsAggregator memory bounds (maxPendingMetrics with error-preferring
    drop, maxErrorsPerStatement, statementTtlMs eviction) restored.
  - FeatureFlagCache in-flight fetch dedup and TTL clamp [60s, 3600s]
    restored; lib/telemetry/urlUtils.ts deleted.
  - close() now properly awaits aggregator drain — fixes the close()/flush
    race that PR #362 already fixed once.
  - Driver version reads lib/version.ts via buildUserAgentString instead
    of hardcoded '1.0.0'; uuidv4() restored in place of Math.random().
  - TelemetryTerminalError re-exported from lib/index.ts.

Type-safe wiring:

  - Added optional getTelemetryEmitter() / getTelemetryAggregator() to
    IClientContext; removed all 7 `(this.context as any)` casts.
  - Six copy-pasted event listeners in DBSQLClient.initializeTelemetry
    collapsed into one `Object.values(TelemetryEventType)` loop — closes
    the listener-name mismatch that silently dropped error events.
  - mapAuthType now covers all 6 authType values instead of defaulting
    everything to 'pat'.

TelemetryClient now owns the host-scoped resources:

  - TelemetryClientProvider is a process-wide singleton (getInstance()).
  - TelemetryClient owns DatabricksTelemetryExporter, MetricsAggregator,
    CircuitBreakerRegistry, and FeatureFlagCache for its host. Implements
    IClientContext itself so the owned components have a stable context
    that survives any single DBSQLClient's close.
  - DBSQLClient instances on the same host share the breaker counters,
    feature-flag cache, exporter, and HTTP batches. Fixes the per-instance
    breaker-fragmentation noted in iter-2 architecture review.
  - Each DBSQLClient still holds its own TelemetryEventEmitter (respects
    per-client telemetryEnabled); emitters bridge into the shared aggregator.
  - Exporter falls back to context.getAuthProvider() when no explicit auth
    provider is passed, so the shared exporter resolves auth through the
    TelemetryClient's FIFO of registered DBSQLClients.

Error telemetry wired across operation entry points:

  - Re-added emitErrorEvent(error) on DBSQLOperation; uses
    ExceptionClassifier.isTerminal() to classify.
  - fetchChunk, cancel, close, getMetadata wrap their bodies in try/catch
    that calls emitErrorEvent before re-throwing. Verified end-to-end
    against a real Azure Databricks workspace: failed query produces
    STATEMENT_COMPLETE + ERROR (with redacted stack) on the wire.
  - Removed the await getMetadata() call from emitStatementComplete —
    eliminates the extra Thrift RPC on every close (F19) AND prevents
    spurious error telemetry from getMetadata's wrapper firing during
    close-cleanup of an already-failed operation.

Other:

  - Iterating Map.keys() while mutating made safe via snapshot in close().
  - STATEMENT_COMPLETE no longer zeroes accumulated chunk metrics when
    the emit doesn't supply them (matches sibling-field guards).
  - Tests for the rebased modules restored from main; provider tests
    updated for the singleton API; deleted unused TelemetryExporterStub.

484 unit tests passing. Diff vs main: ~+2110/-383, down from the
original PR's +3640/-1173.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

- Apply prettier to TelemetryClientProvider.test.ts (sed-edits in the
  prior commit didn't preserve formatting).
- Silence eslint `no-await-in-loop` on the auth-context fall-through
  in TelemetryClient.getConnectionProvider — sequential by intent.
- Drop the empty public constructor on TelemetryClientProvider; leave
  a comment explaining the singleton + test-friendly construction
  contract.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Mocha tests need `function () {}` so they can use `this.timeout()` /
`this.skip()` — arrow functions don't bind `this` to the test context.
The `func-names` rule was firing on every test in the suite (including
pre-existing tests in `protocol_versions.test.ts`); moving the rule to
the test-file override block silences those warnings.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…n, knobs

Iter-3 review fixes addressing 17 distinct findings from the multi-agent
review. Telemetry is now functionally correct and operationally safe.

Critical
- F1: TelemetryClient ctor wires getOrCreateContext on FeatureFlagCache.
  isTelemetryEnabled was previously short-circuiting to false in production
  because no caller registered the host — every customer silently emitted
  zero events.
- F2: integration test asserts the documented default (true), not the prior
  off-by-default. Test was contradicting production code.
- F3: IClientContext.getAuthProvider now optional; consumers use ?.() so
  external implementers don't break on upgrade.

High / privacy
- F4: explicit DATABRICKS_TELEMETRY_DISABLED parser (1/true/yes/on, case
  insensitive). Avoids the footgun where DATABRICKS_TELEMETRY_DISABLED=false
  also disabled telemetry. Documented in CHANGELOG and TSDoc.
- F12: TelemetryClient.registerContext warns on telemetry-config and
  userAgentEntry divergence so multi-tenant misconfig is visible.
- F9: connect()-on-reconnect releases prior refcount; close() clears the
  emitter ref so post-close events can't smuggle into a closed aggregator.
- M-1: redactSensitive strips /home/<user>/, /Users/<user>/, and
  C:\Users\<user>\ patterns from stack traces.
- M-3: FeatureFlagCache.getAuthHeaders falls through to the context's auth
  provider — feature-flag GET is no longer unconditionally unauth.

Operational
- F7: MetricsAggregator.close races the final flush against a configurable
  telemetryCloseTimeoutMs (default 2s) so a flapping endpoint can't hang
  process.exit(0).
- F8: flushInFlight serializer prevents concurrent fire-and-forget flushes
  from starving the user's HTTP socket pool. Drain pattern in close()
  awaits any in-flight flush, then issues a fresh one to capture
  close-time metrics that would otherwise be stranded.
- F16: maxStatementMetrics cap (default 5000) with oldest-first eviction.
  Buffered errors emitted as standalone metrics first so the first-failure
  signal survives.
- DBSQLSession.close() emits connection.close even when closeSession
  fails so failed-close rates are visible in dashboards.

Maintainability
- F10/F17: single withErrorTelemetry helper covers fetchChunk, cancel,
  close, finished, hasMoreRows, getSchema, getMetadata. safeEmit helper
  consolidates seven copy-pasted "get emitter, emit, swallow at debug"
  blocks across DBSQLOperation, DBSQLClient, DBSQLSession,
  CloudFetchResultHandler, RowSetProvider. Also fixes the inconsistency
  where DBSQLSession.close() lacked the swallow wrapper that the other
  six sites had.

API surface
- F13: ConnectionOptions exposes nine telemetry knobs (was three) with
  TSDoc. Adds telemetryFlushIntervalMs, telemetryMaxRetries,
  telemetryCircuitBreakerThreshold, telemetryCircuitBreakerTimeout,
  telemetryCloseTimeoutMs, telemetryMaxStatementMetrics.

Tests
- ClientContextStub gains telemetryEmitter / telemetryAggregator hooks so
  unit tests can assert on emit calls instead of silently no-op'ing.
- 18 new unit tests covering F1 refcount, F12 divergence warn, async-close
  idempotency, error-telemetry wrappers (cancel, close, getMetadata,
  closed-op finished/getSchema/hasMoreRows), multi-context FIFO, and a
  new tests/unit/result/RowSetProvider.test.ts file (RowSetProvider had
  no test file at all). 783 unit tests pass; live e2e against
  adb-27363120558779.19.azuredatabricks.net validates the full pipeline.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant